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Abstract 

Compressed inverted indices in use today are based on the idea 
of gap compression: documents pointers are stored in increas- 
ing order, and the gaps between successive document pointers 
are stored using suitable codes which represent smaller gaps 
using less bits. Additional data such as counts and positions 
is stored using similar techniques. A large body of research 
has been built in the last 30 years around gap compression, in- 
cluding theoretical modeling of the gap distribution, special- 
ized instantaneous codes suitable for gap encoding, and ad 
hoc document reorderings which increase the efficiency of in- 
stantaneous codes. This paper proposes to represent an index 
using a different architecture based on quasi-succinct repre- 
sentation of monotone sequences. We show that, besides be- 
ing theoretically elegant and simple, the new index provides 
expected constant-time operations and, in practice, significant 
performance improvements on conjunctive, phrasal and prox- 
imity queries. 



1 Introduction 

An inverted index over a collection of documents contains, for 
each term of the collection, the set of documents in which the 
term appears and additional information such as the number 
of occurrences of the term within each document, and possi- 
bly their positions. Inverted indices form the backbone of all 
modern search engines, and the existence of large document 
collections (typically, the web) has made the construction of 
efficient inverted indices ever more important. 

Compression of inverted indices saves disk space, but more 
importantly also reduces disk and main memory accesses JS], 
resulting in faster evaluation. We refer the reader to the book 
by Manning, Raghavan and Schutze [19] and to the very com- 
plete and recent survey by Zobel and Moffat ll27l for a thor- 
ough bibliography on the subject. 

Two main complementary techniques are at the basis of 
index compression: instantaneous codes provide storage for 
integers that is proportional to the size of the integer (e.g., 
smaller numbers use less bits); gap encoding turns lists of in- 
creasing integers (for instance, the monotonically increasing 
list of numbers of documents in which a term appear) into lists 
of small integers, the gaps between successive values (e.g., the 
difference). The two techniques, combined, make it possible 
to store inverted indices in highly compressed form. Instan- 
taneous codes are also instrumental in storing in little space 



information such as the number of documents in which each 
term appears. 

Since inverted indices are so important for search engines, 
it is not surprising that a large amount of research has studied 
how to maximize either the speed or the compression ratio of 
gap-encoded indices. Depending on the application, compres- 
sion or speed may be considered more important, and different 
solutions propose different tradeoffs. 

In this paper, we describe a new type of compressed in- 
dex that does not use gaps. Rather, we carefully engineer 
and tailor to the needs of a search engine a well-known quasi- 
succinct representation for monotone sequences proposed by 
Peter Elias [ 1 3 1 Q We explain how to code every part of the in- 
dex by exploiting the bijection between sequence of integers 
and their prefix sums, and we provide details about the physi- 
cal storage of our format. 

Our new index is theoretically attractive: it guarantees to 
code the information in the index close to its information- 
theoretical lower bound, and provides on average constant- 
time access to any piece of information stored in the in- 
dex, including searching for elements larger than a given 
value (a fundamental operation for computing list intersec- 
tions quickly). This happens by means of a very simple ad- 
dressing mechanism based on a linear list of forward pointers. 
Moreover, sequential scanning can be performed using a very 
small number of logical operation per element. We believe 
it is particularly attractive for in-memory or memory-mapped 
indices, in which the cost of disk access is not dominant. 

To corroborate our findings, in the last part of the paper, 
we index the TREC GOV2 collection and a collection or 130 
million page of the . uk wet@ with different type of encodings, 
such as 5 and Golomb. We show that, while not able to beat 
gaps coded with Golomb codes, our index compresses better 
than j/S codes or variable-length byte. 

We then compare a prototype Java implementation of our 
index against MG4J and Lucene, two publicly available Java 
engine based, and Zettair, a C search engine. MG4J has 
been set up to use j/5 codes, whereas Lucene and Zettair 
use variable-length byte codes. We get a full confirmation 
of the good theoretical properties of our index, with excellent 
timings for conjunctive, phrasal and proximity queries. We 
also provide some evidence that for pointers list our index is 
competitive with the Kamikaze implementation of PForDelta 



1 Incidentally, Elias also invented some of the most efficient codes for gap 
compression 1 14']. 

2 We remark that TREC GOV2 is publicly available, and that the latter 
collection is available from the author. 
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codes EH). 

The quasi-succinct indices described in thispaper are the 
default indices used by MG4J from version 5.0|j 

2 Related work 

The basis of the current compression techniques for inverted 
indices is gap encoding, developed at the start of the '90s PI . 
Gap encoding made it possible to store a positional inverted 
index in space often smaller than the compressed document 
collection. Gaps (differences between contiguous document 
pointers in the posting list) have to be encoded using instan- 
taneous codes that use shorter codewords for smaller integers, 
and previous research in information theory provided 7, S lfT4l 
and Golomb [ 15] codes, which achieve excellent compression. 
Moreover, a wealth of alternative codes have been developed 
in the last 30 yearsQ 

When speed is important, however, such codes are rather 
slow to decode: in practice, often implementation use the folk- 
lore variable-length byte code (e.g., the open-source search 
engine Lucene, as well as Zettair). Recent research has devel- 
oped a number of word-aligned codes (e.g., [2 1) that encode in 
a single machine word several integers, providing high-speed 
decoding and good compression. In l28l . the author tailor 
their PForDelta code to the behavior of modern super-scalar 
CPUs and their caches. 

More specialized techniques tackle specific problems, 
studying in great detail the behaviour of each part of the in- 
dex: for instance, [26| studies in great detail the compression 
of positional information. 

Another line of research studies the renumberings of the 
documents that generate smaller gaps. This phenomenon is 
known as clustering [20], and can be induced by choosing a 
suitable numbering for the documents [5. 24] 0. 

As indices became larger, a form of self-indexing ED be- 
came necessary to compute quickly the intersection of lists of 
documents, an operation that is at the basis of the computation 
of conjunctive Boolean queries, proximity queries and phrasal 
queries. 

The techniques used in this paper are based on a seminal pa- 
per by Elias [ 13 1, which is a precursor of succinct data struc- 
tures for indexed sets [22 1 . We do use some of the knowledge 
developed by the algorithmic community working on succinct 
data structures, albeit in practice the theoretical encodings de- 
veloped there, which concentrate on attaining asymptotically 
optimal speed using o(n) additional bits, where n is the op- 
timal size for the data structure, have presently too high con- 
stant costs to be competitive in real applications with methods 
using 0(n) additional bits. 

We remark that the literature on the subject is actually im- 
mense, and impossible to recap in this section. The references 
above should be considered mostly as pointers. We refer the 
reader again to |fT9l |271 for a complete historical overview. 

3 http : //mg4j . di .unimi . it/ 

4 Alternative approaches, such as interpolative coding | 20 |, have been pro- 
posed to code some part of an index, but they lack the direct-access and skip- 
ping features that are necessary for fast query resolution. 



3 Definitions 

In this paper we discuss the indexing problem for a collection 
of documents. We give definitions from scratch as we will 
need to discuss formally the index content. 

Each document is represented by a number, called docu- 
ment pointer, starting from zero. Each document d has a 
length I, and is formed by a sequence of terms to, ti, . . . , ti-\- 
For each document and each term, the count specifies how 
many times a term appears in the sequence forming the doc- 
ument. The frequency is the number of documents in which 
a term appears (i.e., the number of documents for which the 
count is not zero). The occurrency of a term is the number 
of occurrences of the term in the whole collection, that is, the 
sum of the counts of the term over all documents. 

The posting list for a term is the (monotonically increasing) 
list of documents where the term appears. With each docu- 
ment we associate also the (nonzero) count of the term in the 
document, and the (monotonically increasing) list of positions 
(numbered from zero) at which the term appears in the given 
document. 

The unary code associates with the natural number n > 
the codeword n l. The negated unary code associates with 
the natural number n > the codeword 1™0. 

A bit array of length n is a sequence of bits bo, b\, 
. . . , 6„_i. We sometime view such an array as a stream: we 
assume that there is an implicit pointer, and that I/O opera- 
tions such as reading unary codes are performed by scanning 
the array and updating the implicit pointer accordingly. 

4 Quasi-Succinct Representation of 
Monotone Sequences 

In this section we give a detailed description of the high 
bits/low bits representation of a monotone sequence proposed 
by Elias fT51 . We assume to have a monotonically increasing 
sequence of n > natural numbers 

< x < X\ < ■ ■ ■ < x n - 2 < x n -i < u, 

where u > is any upper bound on the last value@ The choice 
u = x n -i is of course possible (and optimal), but storing ex- 
plicitly x n -i might be costly, and a suitable value for u might 
be known from external information, as we will see shortly. 
We will represent such a sequence in two bit arrays as follows : 

• the lower £ = max{ 0, [log(u/n)J } bits of each Xi are 
stored explicitly and contiguously in the lower-bits ar- 
ray^ 

• the upper bits are stored in the upper-bits array as a se- 
quence of unary-coded gaps. 

In Figure Q] we show an example. Note that we code the 
gaps between the values of the upper bits, that is, [xi/2^J — 
\xi^i/2 l \ (with the convention x _i = 0). 

5 If u = 0, the list is entirely made of zeroes, and its content is just defined 
by n. 

6 Actually, Elias discusses just the case in which u + 1 and n + 1 are 
powers of two, but extending his definitions is an easy exercise. 
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Figure 1: A simple example of the quasi-succinct encoding 
from IPPJI . We consider the list 5, 8, 8, 15, 32 with upper 
bound 36, so £ — [log(36/5)J = 2. On the right, the lower 
£ bits of all elements are concatenated to form the lower-bits 
array. On the left, the gap of the values of the upper bits are 
stored sequentially in unary code in the upper-bits array. 



The interesting property of this representation is that it uses 
at most 2 + [log(u/n)] bits per element: this can be easily 
seen from the fact that each unary code uses one stop bit, and 
each other written bit increases the value of the upper bits by 
2 e : clearly, this cannot happen more than [x„_i/2 f J times. 
But 



2<- 



< 



< 



u 



2max{ 0, Llog(u/ra)J } 



< 2n. (1) 



Thus, we write at most n ones and 2n zeroes, which implies 
our statement as [log(u/n)] = [log(u/n)J + 1 unless u/nis 
a power of two, but in that case (Q~|i actually ends with < n, so 
the statement is still true. 

Since the information-theoretical lower bound for a mono- 
tone list of n elements in a universe of u element is 



log 



u + n 
n 



nlog 



we see that the representation is close to succinct: indeed, 
Elias proves in detail that this representation is very close to 
the optimal representation (less than half a bit per element 
away). Thus, while it does not strictly classify as a succinct 
representation, it can be safely called a quasi-succinct repre- 
sentation]^] 

To recover Xi from the representation, we perform i unary- 
code reads in the upper-bits array, getting to position p: the 
value of the upper bits of Xi is then exactly p — i; the lower £ 
bits can be extracted with a random access, as they are located 
at position il in the lower-bits array. 

We now observe that, assuming to have a fictitious element 
X-x = 0, we can equivalently see the list xo, x%, . . . , x n -\ as 
a list of natural numbers by computing gaps: 



ao — x - z-i,ai = Xx - x , - ■ ■ ,a„ 



X n -2- 



'Actually, the representation is one of the ingredients of sophisti- 
cated, modern succinct data structures that attain the information-theoretical 
bound (22). 



Conversely, given a list ao, ax, ■ • ■, a n -i of natural num- 
bers we can consider the list of prefix sums = z2i=o a i 
for < k < n. The two operations give a bijective corre- 
spondence between monotone sequence^ bounded by u and 
lists of natural numbers of the same length whose sum is 
bounded by u@ Thus, we can represent using the high bits/low 
bits presentation either monotonically increasing sequences, 
or generic lists of integers 

The quasi-succinct representation above has a number of 
useful properties that make it quite advantageous over gap- 
encoded sequences: 

• The distribution of the document gaps is irrelevant: there 
is no code to choose, because the lower bits are stored 
explicitly in a fixed-width format, and the representation 
of the upper bits, being made by n ones and at most 2ri 
zeroes, is a perfect candidate for the unary code. 

• Compression is guaranteed irrespective of gaps being 
well distributed (e.g., because of correlation between the 
content of consecutive document) or not. In particular, 
renumbering documents in a way that improves retrieval 
speed (e.g., to ease early termination) will not affect the 
index size. 

• Scanning sequentially the list using a longword buffer re- 
quires to perform just a unary read and using few shifts 
for each element. 

• In general, the high bits/low bits representation concen- 
trates the difficulty of searching and skipping on a simple 
bit array of unary codes containing n ones and at most 
2n zeroes. We can devise extremely fast, practical ad 
hoc techniques that exploit this information. 

Actually, Elias's original paper suggests the most obvious 
solution for quick (on average, constant-time) reading of a se- 
quence of unary codes: we store forward pointers to the po- 
sitions (inside the upper-bits array) that one would reach after 
kq unary-code reads, k > 0, where q is a fixed quantum (in 
other words, we record the position immediately after the one 
of index kq — 1 in the bit array). 

Retrieving Xi now can be done by simulating q [i/q\ unary 
reads using a forward pointer, and completing sequentially 
with i mod q < q unary-code reads. On average, by ([T}, the 
sequential part will read at most 3g bits0 Smaller values of q 
yield less reads and use more space. 



8 Note that sequences of prefix sums contain an additional element sq = 
that is not part of the bijection. 

9 The same bijection is used normally to code monotone sequences using 
gaps, but we intend to to the opposite. 

'"Prefix sums have indeed several applications in compression, for instance 
to the storage of XML documents II 11 . 

"This problem is essentially (i.e., modulo an off-by-one) the selec- 
tion problem for which much more sophisticated solutions, starting with 
Clarke's |9|, have in the last years shown that constant-time access can be 
obtained using o(n) additional bits instead of the 0(n) bits proposed by 
Elias, but such solutions, while asymptotically optimal, have very high con- 
stant costs. Nonetheless, there is a large body of theoretical and practical 
knowledge that has been accumulated in the last 20 years about selection, and 
we will use some of the products of that research to read multiple unary codes 
quickly in the upper-bits array. 
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Skipping. A more interesting property, for our purposes, is 
that by storing skip pointers to positions reached after negated 
unary-code reads of the upper bits it is possible to perform 
skipping, that is, to find very quickly, given a bound 6, the 
smallest X{ > b. This operation is fundamental in search en- 
gines as it is the base for quick list intersection^ 

To see why this is possible, note that by definition in the 
upper-bits array the unary code corresponding to the smalle st 
Xi > b must terminate after [6/2 J zeroes. We could thus per- 
form [6/2 £ J negated unary-code reads, getting to position p, 
and knowing that there are exactly p~ [6/2^ J ones and [6/2 f J 
zeroes to our left (i.e., we are in the middle of the unary code 
for x |^/2«J From nere ' we complete the search exhaus- 
tively, that is, we actually compute the values of the elements 
of the list (by reading unary codes and retrieving the suitable 
lower bits) and compare them with b, as clearly the element we 
are searching for cannot be represented earlier in the upper- 
bits array. An example is shown in Figure 

By setting up an array of skip pointers analogously to the 
previous case (i.e., forward pointers), the reading of negated 
unary-codes can be perform quickly. Note, however, that in 
general without further assumptions it is not possible to bound 
the number of bits read during the \b/2 e \ mod q negated 
unary-code reads that must be performed after following a skip 
pointer, as there could be few zeroes (actually, even none) in 
the bit array. Nonetheless, if a linear lower bound on the num- 
ber of zeroes in the bit array is known, it can be used to show 
that skipping is performed in constant time on average. 

Strictly monotone sequences. In case the sequence xq, %i, 
. . . , x n -i to be represented is strictly monotone (or, equiva- 
lently, the a/s are nonzero), it is possible to reduce the space 
usage by storing the sequence — i using the upper bound 
u — n. Retrieval happens in the same way — one just has to ad- 
just the retrieved value for the i-th element by adding i. This 
mechanism was already noted by Elias [12] (more generally 
for /c-spaced sequences, k > 0), but it is important to remark 
that under this representation the algorithm for skipping will 
no longer work. This happens because X{ is actually repre- 
sented as Xi — i, so skipping [6/2^] negated unary codes could 
move us arbitrarily after the element we would like to reach. 



5 Sequences as a Ranked Characteris- 
tic Functions 

In some cases, the quasi-succinct representation we described 
is not very efficient in term of space: this happens, for in- 
stance, for very dense sequences. There is however an alter- 
nate representation for strictly monotone sequences with skip- 
ping: we simply store a list of u bits in which bit k is set if k is 
part of the list Xq, x\, . . . , x n -±. This is equivalent to storing 
the list in gap-compressed form by writing in unary the gaps 



12 Elias describes a slightly different analogous operation, by which he finds 
the largest Xi < b; the operation involves moving backwards in the bit array, 
something that we prefer to avoid for efficiency. Note that this is again es- 
sentially equivalent to predecessor search, a basic problem in fast retrieval 
on sets of integers for which very strong theoretical results are known in the 
RAM model (f). 



Xi — Xi-i — 1, and guarantees by definition that no more than 
u bits will be used. 

Skipping in such a representation is actually trivial: given 
the bound 6, we read a unary code starting at position 6. The 
new position Xi is such that xi is the smallest element satisfy- 
ing Xi > b. The only problem is that at this point we will have 
lost track of the index i. 

To solve this problem, we take a dual approach to that of 
the previous section and store a simple ranking structure: for 
each position kq, where q is the quantum, we store the number 
of ones to the left. After a skip, we simply rank the current 
position Xi by first reading the precomputed number of ones 
before [xi / q\ , and then then computing the number of ones in 
the at most q remaining bits. 

6 Representing an Inverted Index 

We now discuss how the quasi-succinct representation pre- 
sented in the previous section can be used to represent the 
posting list of a term. We defer to the next section a detailed 
discussion of the data-storage format. 

Pointers. Document pointers form a strictly monotone in- 
creasing sequence. We store them using the standard represen- 
tation (i.e., not the specialized version for strictly monotone 
sequences), so to be able to store skip pointers, as skipping is 
a frequent and useful operation (e.g., during the resolution of 
conjunctive Boolean queries or phrasal queries), whereas ran- 
dom access to document pointers is not in general necessarvFI 
The upper bound is the number of documents N minus one, 
and the number of elements of the list is /, the frequency. 

We remark that the apparent loss of compression due to the 
necessity of using the standard representation (to make skip- 
ping possible) turns actually into an advantage: if the last 
pointer in the list is equal to aN, with < a < 1, since 
N > /, we can write N — df + r with d > and < r < /, 
and then we have 



aN 




a(df + r) 


_ 2 e _ 




2 Llog((<i/+r)//)J 



> 



<*(df + r) 
2 Lio g dJ 



> af. (2) 



In other words, the slight redundancy guarantees that there 
are at least af zeroes in the upper-bits array: if a « 1, we 
can thus guarantee that on average skipping can be performed 
in a constant number of steps, as, on average, reading a one 
implies reading at least a zero, too (and viceversa). Since we 
write forward pointers only for lists with / > q, under realistic 
assumptions on q in practice a is close to 1. 

Finally, even in pathological cases (i.e., a every uneven dis- 
tribution of the zeroes in the list), one every 2 e < N / f bits 
must necessarily be zero, as the list is strictly monotone. Thus, 
terms with dense posting lists must have frequent zeroes inde- 
pendently of the considerations above. 

Note that if 

/ + [N/2 l \ +f£>N 

then the representation above uses more than N bits (in prac- 
tice, this happens when / > N/3). In this case, we switch to 



13 Nothing prevents from storing both kind of pointers. The increase in size 
of the index would be unnoticeable. 
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Figure 2: An example of skipping based on the sequence shown in Figure[T] On the left we have the upper-bits array, and on 
the right the lower-bits array. We want to skip to the first item larger than or equal to 22, so since I = 2 we have to perform 
[22/2 2 J = 5 negated unary-code reads (the continuous arrows), getting to position 9, so we are positioned in the middle of 
the unary code associated with the element of index 9 — 5 = 4. Then we perform a unary-code read (the dashed arrow), which 
returns 3, so we know that the upper bits of the current element (of index 4) are 3 + 5 = 8. Since the block of lower bits of 
index 4 is zero, we return 32. If we had at our disposal a skip pointer for q — 4 (the dotted arrow), we could have skipped the 
first four negated unary-code reads. Note that in general more than one unary-code read might be necessary after reading the 
negated unary codes. 



a ranked characteristic function. Since there are at most two 
zeroes for each one in the bitmap, it is easy to check that all 
operations can still be performed in average constant time. 
Counts. Counts are strictly positive numbers, and can 
be stored using the representation for strictly monotone se- 
quences to increase compression. In this case the upper bound 
is the occurrency of the term, and the number of elements is 
again the frequency. 

Positions. The format for positions is the trickiest one. Con- 
sider, for the i-th document pointer in the inverted list for term 
t with count Cj, the list of positions p , p\, . . . , p\._\. First, 
we turn this list into a list of strictly positive smaller integers: 

pi + l,p\ -Ph,pi-p\,---,pi i -i -?4-2- 
Consider the concatenation of all sequences above: 

Po^ s -iP\ f>0) • • • fPco-l Pco-2) 

/-I i 1 /-I /-I f-l /-I /i\ 

Pa +l,Pi -Pa .•••>Pi / _ 1 -i-*V 1 -2) ( 3 ) 

and store them using the representation for strictly positive 
numbers. In this case it is easy to check that the best upper 
bound is 

0<i<f 

and the number of elements is the occurrency g of the term. 

We now show how to retrieve the positions of the i-th doc- 
ument. Let so, Sx, . . . , Sf be prefix sums of the counts (e.g., 
Cj = Sj+x — s i)- We note that the list provides the starting and 
ending point of the sequence of positions associated to a doc- 
ument: the positions of document i occur in (fJJ at positions j 
satisfying s$ < j < Sj+i. Let to, t\, ...,t g be the sequence 
of prefix sums of the sequence (0. It is easy to check that the 
positions of i-th document can be recovered as follows: 

p) = t Si+j+ i -t Si -l < j < Ci. 

We remark that the nice interplay between prefix sums and 
lists of natural numbers is essential in making this machinery 
work: we need the counts Cj (e.g., to compute a content-based 
ranking function), but we need also their prefix sums to locate 
positions. 



7 A Quasi-Succinct BitStream 

We now discuss in detail the bit stream used to store the quasi- 
succinct representation described in Section |4] — in particular, 
the sizing of all data involved. 

Metadata pertaining the whole representation, if present, 
can be stored initially in a self-delimiting format. Then, the 
remaining data is laid out as follows: pointers, lower bits, up- 
per bits (see Figure|3}- The rationale behind this layout is that 
the upper-bits array is the only part whose length is in princi- 
ple unknown: by positioning it at the end of the bitstream, we 
do not have to store pointers to the various parts of the stream. 
The lower-bits array will be located at position sw, where s is 
the number of pointers and w their width, and the upper-bits 
array at position pw + n£ bits after the metadata. We can thus 
compute without further information the starting point of each 
part of the stream. 

We assume that the number of elements n is known, possi- 
bly from the metadata. The first issue is thus the size and the 
number of pointers. If the upper bound u is known, we know 
that the upper-bits array is n + \ u/ 2 l \ bits long at most, so the 
width of the pointers is w — [log(n + [u/2 e \ + 1)]; other- 
wise, information must be stored in the metadata part so to be 
able to compute w. 

If we are storing forward pointers for unary codes, the num- 
ber of pointers will be exactly \n/q\ ; otherwise (i.e., if we are 
storing forward pointers for negated unary codes), they will be 
at most s — \(n+ [u/2 e \ ) /q\ Again, if the bound u is not 
known it is necessary to store information in the metadata part 
so to be able to compute s. 

Analogously, if u is not known we need to store metadata 
that makes us able to compute £ = [\og(u/n)\ . 

Finally, in the case of a ranked characteristic functions in- 
stead of pointers we store [f/q\ cumulative ranks of width 
w = [log AT] , followed by the bitmap representation of the 
characteristic function. 

8 Laying Out the Index Structure 

We now show how to store in a compact format all metadata 
that are necessary to access the lists. For each index compo- 

14 We remark that if u > x n — i some of the s pointers might actually be 
unused. It is sufficient to set them to zero (no other pointer can be zero) and 
consider them as skips to the end of the list. 
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Figure 3: The bit stream of a quasi-succinct encoding for a list of n items using s forward pointers. After a self-delimiting 
metadata section, there are fixed-width forward pointers, the lower-bits array, and finally the upper-bits array. In this example, 
Pi points at the location of the upper-bits array where one would get after iq unary-code reads, with q = 2. Pointer Po is never 
stored explicitly. 



nent (document pointers, counts, positions) we write a sepa- 
rate bit stream. We remark that for an index that provides nat- 
urally constant-time access to each element, there is no point 
in interleaving data, and this is another advantage of quasi - 
succinct encoding, as unnecessary data (e.g., counts and posi- 
tions for a Boolean query) need not be examined. As usual, for 
each term we store three pointers locating the starting point of 
the information related to that term in each stream. 

The bit stream for document pointers contains as metadata 
the frequency and the occurrency of the term. We write the oc- 
currency in 7 code and, if the occurrency is greater than one, 
the difference between occurrency and frequency, again in 7 
code (this ensures that hapaxes use exactly one bit). This in- 
formation, together with the number of documents in the col- 
lection, is sufficient to access the quasi-succinct representation 
of document pointers (see Section|6]i. 

The bit stream for counts contains no metadata. The occur- 
rency and frequency can be obtained from the pointers stream, 
and they are sufficient to access the representation. 

The bit stream for positions requires to store in the meta- 
data part the parameter I and the skip-pointer size w, which 
we write again in 7 code, as the upper bound (|4|i is not avail- 
able. Note that if the occurrency is smaller than q, there is 
no pointer, and in that case we omit the pointer size. Thus, 
the overhead for terms with a small number of occurrences is 
limited to the parameter 

9 Implementation Details 

Implementation details are essential in a performance-critical 
data structure such as an inverted index. In this section we 
discuss the main ideas used in our implementation. While rel- 
atively simple, these ideas are essential in obtaining, besides 
good compression, a significant performance increase. 

Longword addressing. We either load the index into mem- 
ory, or access it as a memory-mapped region. Access happens 
always by longword, and shifts are used to extract the relevant 
data. The bit k of the index is represented in longword [&/64J 
in position k mod 64. While direct access to every point of the 
bitstream is possible, we keep track of the current position so 
that sequential reads use the last longword read as a bit buffer. 
Extraction of lower bits requires very few logical operations 
in most cases when I is small. 

15 Actually, it is easy to check that the overhead for hapaxes is exactly 2 bits 
with respect to writing the only existing position in 8 code. 



Reading unary codes. Reading a unary code is equivalent to 
the computation of the least significant bit. We use the beau- 
tiful algorithm based on de Brujin's sequences |[T8l . which is 
able to locate the least significant bit using a single multipli- 
cation and a table lookup. The lack of any test makes it a very 
good choice on superscalar processors, as it makes prediction 
and out-of-order execution possible 

Both when looking up an entry and when skipping, we 
have, however, to perform a significant number of unary-code 
reads (on average, w q/2). To this purpose, we resort to 
a broadword (a.k.a. SWAR, i.e., "SIMD in A Register") bit 
search [25 1. The idea is that of computing the number of ones 
in the current bit buffer using the classical algorithm for side- 
ways addition 1171 . which involves few logical operations and 
a multiplication. If the number of reads we have to perform 
exceeds the number of ones in the current buffer, we exam- 
ine the next longword, and so on. Once we locate the right 
longword, we can complete the search using the broadword 
selection algorithm presented in 11251 . 

Our experiments show that broadword bit search is ex- 
tremely effective, unless the number of reads is very small, 
as in that case computing iteratively the least significant bit 
becomes competitive. Indeed, when skipping a very small 
number of position (e.g., less then eight) we simply resort to 
iterating through the list. 

Cache the last prefix sum. When retrieving a count or the 
first position of a position list, we have, in theory, to compute 
two associated prefix sums. During sequential scans, however, 
we can cache the last computed value and use it at the next 
call. Thus, in practice, scanning sequentially counts or posi- 
tions requires just one unary-code read and one fixed-width bit 
extraction per item. Reading counts is however made slower 
by the necessity to compute the difference between the current 
and the previous prefix sum. 

Trust the processor cache. The cost of accessing an in- 
memory index is largely dominated by cache misses. It is thus 
not surprising that using a direct access (i.e., by pointer) can 
be slower than actually scanning linearly the upper-bits array 
using a broadword bit search if our current position is close 
to the position to get to. The threshold depend on architec- 
tural issues and must be set experimentally. In our code we 
use q — 256 and we do not use pointers if we can skip to the 



Actually, we first check whether we can compute the least significant bit 
using an 8-bit precomputed table, as the guaranteed high density of the upper 
bits makes this approach very efficient. 



6 



desired position in less that q reads0 An analogous strategy 
is used with ranked characteristic functions: if we have to skip 
in the vicinity of the current position and the current index is 
known we simply read the bitmap, using the sideways addition 
algorithm to keep track of the current index. 

10 Experiments 

We have implemented the quasi-succinct index described in 
the previous section in Java, and for the part related to docu- 
ment pointers and count, in C++. All the code used for exper- 
iments is available at the MG4J web site. In this section, we 
report some experiments that compare its performance against 
three competitors: 

• Lucene, a very popular open-source Java search engine 
(release 3.6.0); 

• the classical high-performance indices from MG4J Q, 
another open-source search engine (release 5.0); 

• Zettair, a search engine written in C by the Search Engine 
Group at RMIT University. 

• The Kamikaz^H library, implementing the 
PForDelta ll28l sequence compression algorithm 
(up-to-date repository version from GitHub); 

• We compare also with recent optimized C code imple- 
menting PForDelta compression document pointers and 
count kindly provided by Ding Shuai ll23l . 

Zettair has been suggested by the TREC organizers as one of 
the baselines for the efficiency track. The comparison of a 
Java engine with a C or C++ engine is somewhat unfair, but 
we will see actually the Java engines turn out to be always 
significantly faster. 

We use several datasets summarized in Table Q] first, the 
classical public TREC GOV2 dataset (about 25 million doc- 
uments) and a crawl of around 130 million pages from the 
. uk domain that is available from the author. Tokens were 
defined by transition between alphanumerical to nonalphanu- 
merical characters or by HTML flow-breaking tags, and they 
were stemmed using the Porter2 stemmei|3- Besides an index 
considering the whole HTML document, we created some in- 
dices for the title text only (e.g., the content of the HTML 
TITLE element), as such indices have significantly different 
statistics (e.g., documents are very short). 

Additionally, we created a part-of-speech index used within 
the Mfrnir semantic engine iflOl ; such indices have a very 
small number of terms that represent synctactic elements 
(nouns, verbs, etc.), very dense posting lists and a large num- 
ber of positions per posting: they provide useful information 
about the effectiveness of compression when the structure of 
the index is not that of a typical web text index. For the same 

17 Remember, again, that we will actually simulate such reads using a 
broadword bit search. 

18 http : //sna-proj ects . com/kamikaze/ 

"Zettair, however, supports apparently only the original Porter stemmer. 





Documents 


Terms 


Postings 


Occurrences 


TREC GOV2 


Text 


25 M 


35 M 


5.5 G 


23 G 


Title 


25 M 


1.1M 


135M 


150M 


Web .uk 


Text 


130M 


99 M 


21G 


62 G 


Title 


130M 


3.2M 


609 M 


691 M 




Mfmir index 


Token 


1M 


49 


27M 


1.2G 


Tweets 


Text 


13M 


2.3M 


147 M 


156M 



Table 1: Basic statistics for the datasets used in our experi- 
ments. 



reason, we also index a collection of about a dozen millions 
tweets from Twitter. 

Small differences in indexing between different search en- 
gines are hard to track: the details of segmentation, HTML 
parsing, and so on, might introduce discrepancies. Thus, we 
performed all our indexing starting from a pre-parsed stream 
of UTF-8 text documents. We also checked that the frequency 
of the terms we use in our queries is the same — a sanity check 
showing that the indexing process is consistent across the en- 
gines. Finally, we checked that the number of results of con- 
junctive and phrasal queries was consistent across the differ- 
ent engines, and that bpref scores were in line with those 
reported by participants to the Terabyte Track. 

Using MG4J, we have created indices that use 7 codes for 
counts, and either 5 or Golomb codes for pointers and posi- 
tional endowed with a mild amount of skipping information 
using around 1% of the index size: we chose this value be- 
cause the same amount of space is used by our index to store 
forward and skip pointers when q = 256. These indices (in 
particular, the ones based on Golomb codes) are useful to com- 
pare compression ratios: if speed is not a concern, they pro- 
vide very good compression, and thus they provide a useful 
reference points on the compression/speed curve 

We remark that we have indexed every word of the collec- 
tions. No stopword elimination has been applied. Commercial 
search engines (e.g., Google) are effortlessly able to search for 
the phrase "Romeo and Juliet", so our engine should be able 
to do the same. 

Compression. Table [2] reports a comparison of the compres- 
sion ratios. Our quasi-succinct index compresses always bet- 
ter than 7/5, but worse than Golomb codes. In practice, our 
index reduces the size of the ^18 index by « 10%, whereas 
Golomb codes reach w 20%. 

The compression of Lucene and Zettair on the text of web 
pages is not very good (a rj 15% increase w.r.t. our index). 
This was partially to be expected, as both Lucene and Zettair 

20 The Golomb modulus has been chosen separately for each document. 
The results we obtain seems to be within 5% of the best compression results 
obtained in 1261 . which suggest a space usage of 21 MB/query on average for 
an average of 20.72 millions positions per query. A more precise estimate 
is impossible, as results in 1261 are based on 1000 unknown queries, and no 
results about the whole GOV2 collection are provided. 

2 'We have also tried interpolative coding [20], but on our collections the 
difference in compression with Golomb codes was really marginal. 
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QS 


MG4J 7 /<5 


Golomb 


Lucene 


Zettair 


TREC GOV2 (text) 


Pointers 


7.42 


8.47 


6.94 






Counts 


2.98 


2.56 








Positions 


10.17 


11.11 


8.65 






Overall 


36.9 GB 


40.3 GB 


31.9GB 


42.1GB 


40.7GB 


TREC GOV2 (title) 


Pointers 


10.04 


11.44 


9.54 






Counts 


1.10 


1.14 








Positions 


3.84 


4.63 


3.05 






Overall 


264 MB 


308 MB 


241MB 


396 MB 


395 MB 


Web .uk(text) 


Pointers 


8.46 


9.72 


7.98 






Counts 


2.39 


2.06 








Positions 


10.16 


10.95 


8.41 






Overall 


108 GB 


117GB 


92 GB 


126 GB 




Web . uk (title) 


Pointers 


11.75 


13.51 


11.27 






Counts 


1.13 


1.18 








Positions 


4.36 


5.06 


3.35 






Overall 


1.38 GB 


1.59GB 


1.26 GB 


2.00 GB 


2.15GB 


Mmiir token index 


Pointers 


1.51 


1.42 


1.48 






Counts 


6.42 


6.28 








Positions 


5.83 


6.22 


5.03 






Overall 


0.96 GB 


1.01GB 


0.83 GB 


1.34 GB 


1.36 GB 


Tweets 


Pointers 


10.13 


10.29 


9.22 






Counts 


1.06 


1.11 








Positions 


4.67 


5.94 


3.86 






Overall 


302 MB 


341MB 


266 MB 


423 MB 


484 MB 



Table 2: A comparison of index sizes. We show the overall index size, which includes skipping structures, and, if available, 
the number of bits per element of each component, excluding skipping structures. 
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use variable-length byte codes for efficiency, and while such 
codes are easy to decode, they are ill-suited to compression. 
When the distribution of terms and positions is different, how- 
ever, compression is significantly worse: for titles we have 
a 50% increase in size, and for the Mmiir semantic index or 
tweets a 40% increase. This is somewhat typical: variable- 
length byte codes compress most positions in a single byte if 
the distribution of words comes from a "natural" distribution 
on documents of a few thousand words. Using shorter docu- 
ments (e.g., titles and tweets) or a different distribution (e.g., 
a semantic index) yields very bad results. A 50% increase in 
size, indeed, can make a difference. 

While we are not aiming at the best possible compression, 
but rather at high speed, it is anyway relieving to know that we 
are improving (as we shall see shortly) both compression and 
speed with respect to these engines. 

Interestingly, counts are the only index component for 
which we obtain sometimes worse results than 7 coding. This 
is somewhat to be expected, as we are actually storing their 
prefix sums. The impact of counts on the overall index, how- 
ever, is quite minor, as shown by the small final index size. 

Speed. Benchmarking a search engine brings up several com- 
plex issues. In general, the final answer is bound to the archi- 
tecture on which the tests were run, and on the type of queries. 
A definite answer can be given only against a real workloadj^l 
Our tests were performed on a recent workstation sporting a 
3.4GHz Intel i7-3770 CPU with 8 MiB of cache and 16GiB 
of RAM. 

We aim at comparing speed of in-memory indices, as one 
of the main reasons to obtain smaller indices is to make more 
information fit into memory; moreover, the diffusion of solid- 
state disks makes this approach reasonable. Thus, in our tests 
we resolve each query three times before taking measure- 
ments. In this way we guarantee that the relevant parts of 
the index have been actually read and memory mapped (for 
MG4J and Lucene, or at least cached by the file system, for 
Zettair), and we also make sure that the Java virtual machine 
is warmed up and has performed inlining and other runtime 
optimizations. With this setup, our tests are highly repeatable 
and indeed the relative standard deviation over several runs is 
less than 3%. 

We used the 150 TREC Terabyte track (2004-2006) title 
queries in conjunctive, phrasal and proximity form (in the lat- 
ter case, the terms in the query must appear in some order 
within a window of 16 words). We also extracted the terms 
appearing in the queries and used them as queries to mea- 
sure pure scanning speed: all in all, we generated 860 queries. 
MG4J and Lucene were set up to compute the query results 
without applying any ranking function. Zettair was set to 
Okapi BM25 ranking [16], which appeared to have the smaller 
impact on the query resolution time (no "no-ranking" mode is 
available). 

All engines were set up to return a single result, so that 
the logic needed to keep track of a large result size would not 

22 Note that in real-world search engines the queries that are actually solved 
are very different by those input by the user, as they undergo a number of 
rewritings. As a consequence, blindingly analzying queries from large query 
logs in disjunctive or conjunctive mode cannot give a reliable estimate the 
actual performance of an index. 



interfere with the evaluation. The results are shown in Table[3] 
The first column (QS) shows the results of query resolution on 
a quasi-succinct index. The third column (MG4J) for a j/S- 
coded high-performance MG4J index. The fourth column for 
Lucene, and the last column for Zettair. 

The second column (QS*) needs some explanation. Both 
Lucene and MG4J interleave document pointers and counts. 
As a consequence, resolving a pure Boolean query has a 
higher cost (as counts are read even if they are not necessary), 
but ranked queries require less memory/disk access. To simu- 
late a similar behaviour in our setting, we modified our code 
so to force it to read the count of every returned document 
pointer. This setting is of course artificial, but it provides a 
good indication of the costs of iterating and applying a count- 
based ranking function, and it will be the based of our compar- 
ison. For phrasal and proximity queries there is no difference 
between QS and QS* as counts have in any case to be read to 
access positions. 

First of all, we note that decoding a quasi-succinct index is 
slightly (« 7%) faster than decoding a gap-compressed index 
that uses variable-byte codes. It is nonetheless important to 
notice that our timings for purely boolean resolution (QS) are 
much lower, and this can be significant in a complex query 
(e.g., a conjunction of disjunctively expanded terms). Zettair 
is much slower. 

More interestingly, we have a pa 50% improvement for con- 
junctive queries, a w 40% improvement for phrasal queries 
and a ps 60% improvement for proximity queries: being able 
to address in average constant time every element of the in- 
dex is a real advantage. We also remind the reader that we are 
comparing a Java prototype with a mature implementation. 

We expect the asymptotic advantage of quasi-succinct in- 
dices to be more evident as the collection size grows. To test 
this hypothesis, we performed further experiments using the 
Web . uk collection and 1000 multi-term queries randomly se- 
lected from a large search-engine query log. The results are 
shown in Table|6] now conjunctive and proximity queries are 
more thrice faster with respect to Lucene. 

In Table|4]we show some data comparing in-memory quasi- 
succinct indices with PForDelta code. The data we display is 
constrained by some limitations: the Kamikaze library does 
not provide count storage; and the optimized C code we are 
using 1 23 1 does not provide positions. This is an important 
detail, as quasi-succinct indices trade some additional efforts 
in decoding counts (i.e., computing their prefix sums) in ex- 
change for constant-time access to positions. Our main goal 
is to speed up positional access — indeed, nothing prevents us- 
ing PForDelta for document pointers and storing counts and 
positions as described in this paper (or even using a separate 
PForDelta index without positions as a first-pass index). 

Kamikaze turns out to be slightly slower for scanning term 
lists, and almost twice as slow when computing conjunc- 
tive queries. To estimate the difference in compression, we 
computed the space used by the document pointers of our 
TREC collections using Kamikaze: the result is an increase 
of ps 55% in space usage. While not extremely relevant for 
the index size (positions are responsible mostly for the size 
of an index), it shows that we would gain no advantage from 
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QS 




M04J 


Lucene 


Zettair 


— — 

l crms 




7 &9 


1 

lu.OO 


o.zu 


1 Q 1 7 


And 


1.29 


1.79 


4.90 


3.90 


20.92 


Phrase 


4.00 




11.01 


6.77 


21.14 


Proximity 


4.76 




12.15 


12.05 





Table 3: Timings in seconds for running the test queries from 
the TREC Terabyte track on GOV2 without scoring. The col- 
umn QS shows the timings for resolving a query on a quasi- 
succinct indices, whereas the column QS* shows the timings 
for a modified version in which counts are forced to be read for 
each decoded document pointer. Measurements were taken af- 
ter three executions of each query, with memory map and disk 
caches already filled. Note that Zettair is actually reading from 
disk and scoring the queries, whereas in the other cases point- 
ers and counts are being read from a memory-mapped region 
and no score is being computed. 





QS 


QS* 


Lucene 


Terms 


70.9 


132.1 


130.6 


And 


27.5 


36.7 


108.8 


Phrase 


78.2 




127.2 


Proximity 


106.5 




347.6 



Table 5: Timings in seconds for running 1000 randomly se- 
lected queries from a search-engine query log on the Web . uk 
collection. See also Table [3] 



storing pointers using PForDelta in a Java engine 
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The comparison of C implementations, on the other hand, 
is definitely in favour of PForDelta: apart from pointer enu- 
meration our C implementation is slower, in particular when 
enumerating terms and their counts. 

There are some important caveats, however: the code we 
have been provided for PForDelta testing [23] is a bare-bone, 
heavily optimised C benchmarking implementation that is 
able to handle only 32-bit document pointers and has a num- 
ber of limitations such as hardwired constants (e.g., the code 
needs to be recompiled if the number of document in the col- 
lection changes). Our C++ code is a 64-bit fully usable im- 
plementation derived from a line-by-line translation of our 
Java prototype code that could be certainly improved by ap- 
plying CPU-conscious optimizations. A more realistic com- 
parison would require a real search engine using PForDelta to 
solve queries requiring positional information, it happens in 
TableljPI 



In Table [6] we report similar data for our Web . uk collec- 
tion: also in this case, a larger collection improves our results 
(in particular, conjunctive queries are only 13% slower than 
PForDelta, instead of « 21%). 



Note that storing positions with PForDelta codes is known to give a com- 
pression rate close to that provided by variable-byte coding \26\. 

24 Such an engine is not available, to the best of the authors's knowledge. 
The authors of 1 26 1 have refused to make their engine available. 





QS(C) 


QS*(C) 


PFD(C) 


PFD*(C) 


Terms 


23.8 


56.8 


23.6 


31.6 


And 


19.2 


24.5 


16.9 


19.4 



Table 6: Timings in seconds for running 1000 randomly se- 
lected queries from a search-engine query log on the Web . uk 
collection. See also Table [4] 

11 Some anecdotal evidence 

While running several queries in controlled conditions is a 
standard practice, provides replicable results and gives a gen- 
eral feeling of what is happening, we would like to discuss 
the result of a few selected queries that highlight the strong 
points of our quasi-succinct indices. We keep the same set- 
tings as in the previous section (e.g., ranked queries repeated 
several times to let the cache do its work). All timings are in 
milliseconds. 

Dense terms. We start by enumerating all documents in which 
the term "and" appears (w 18 millions): 



QS 


Kamikaze 


QS* 


MG4J 


Lucene 


Zettair 


72.4 


179.2 


234.6 


488.5 


283.6 


1246.5 



In this case, our quasi-succinct index is a ranked characteristic 
function. Reads are particularly fast (just a unary code read), 
and combined with count reading faster than Lucene. Note 
that we compress this list at « 1.38 bits per pointer, against 
the w 2.38 bits of Kamikaze and the 8 bits of Lucene. The 
slowness of Zettair is probably due to the fact that positional 
information is interleaved with document pointers, so it is nec- 
essary to to skip over it. 

Another example (this time using an Elias-Fano represen- 
tation) is the enumeration of all documents in which the term 
"house" appears (« 2 millions): 



QS 


Kamikaze 


QS* 


MG4J 


Lucene 


Zettair 


17.2 


19.4 


31.9 


42.2 


33.2 


69.1 



An Elias-Fano list requires recovering also the lower bits, and 
thus it is slightly slower: overall, if we read counts we are just 
slightly faster than Lucene, as expected. 

Conjunction of correlated terms. Consider the conjunction 
of the terms "home" and "page", which appears in about one 
fifth of the documents in the GOV2 collection: 



QS 


Kamikaze 


QS* 


MG4J 


Lucene 


Zettair 


204 


295 


420 


561 


416 


933 



We can see that in this case quasi-succinct indices are already 
better than Kamikaze at conjunction, but nonetheless the high 
correlation makes our constant-time skipping not so useful. 

On the other hand, consider the conjunction of the terms 
"good", "home" and "page", which appears in about l/30th of 
the documents in the GOV2 collection: 



QS 


Kamikaze 


QS* 


MG4J 


Lucene 


Zettair 


73 


153 


164 


471 


294 


709 



The query is now more complex, but, more importantly, there 
is a term that is significantly less frequent than the other two. 
Quasi-succinct indices have now a significant advantage. 
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QS 


QS* 


Kamikaze 


QS(C) 


QS*(C) 


PFD(C) 


PFD*(C) 


Terms 


3.83 


7.30 


4.23 


1.61 


4.05 


1.57 


2.39 


And 


1.16 


1.62 


2.08 


0.91 


1.25 


0.75 


0.87 



Table 4: Timings in seconds for running the term and conjunctive test queries from the TREC Terabyte track on GOV2 directly 
from RAM. Timings for quasi-succinct indices are provided both for Java and C++ 64-bit implementations. PForDelta timings 
have been computed both using the Kamikaze library and using optimized 32-bit C code provided by Ding Shuai [23 1. Starred 
versions include reading counts for all returned document pointers. 



It is interesting to compare the above table with the timings 
for the phrasal query "home page": 



QS 


MG4J 


Lucene 


Zettair 


1282 


1693 


1228 


977 



Now the engines have essentially to read wholly all posting 
lists. No skipping is possible (it would be actually detrimen- 
tal). Most of the time is spent trying to figure out which of 
the documents containing the three terms actually contains 
the three terms in a row. The overhead of Java becomes here 
visible — this is indeed our only example in which Zettair is 
the fastest engine. 

It is interesting to compare the above table with the timings 
for the phrasal query "good home page": 



QS 


MG4J 


Lucene 


Zettair 


540 


1251 


880 


795 



Conjunction of uncorrelated terms. The terms "foo" and 
"bar" appear in about 650 000 documents, but they co-occur 
just in about 5 000: 



QS 


Kamikaze 


QS* 


MG4J 


Lucene 


Zettair 


1.27 


2.28 


2.00 


7.09 


3.11 


35.39 



The smallness of the intersection gives to our skipping logic a 
greater advantage than in the previous case. 

The terms "fast" and "slow" appear in about 1 ,000 000 doc- 
uments, but they co-occur just in about 50000: 



QS 


Kamikaze 


QS* 


MG4J 


Lucene 


Zettair 


9.21 


10.0 


12.45 


25.21 


17.20 


45.22 



Complex selective queries. The query "foo bar fast slow" has 
ps 250 results: 



QS 


Kamikaze 


QS* 


MG4J 


Lucene 


Zettair 


1.25 


2.20 


1.32 


7.21 


7.48 


68.26 



The more the query becomes selective, the greater the advan- 
tage of average constant-time positioning. Note, in particular, 
that the timing for QS* decreases, as less counts have to be 
retrieved (and they can be retrieved quickly). 

Phrases with stopwords. As we remarked in the previous 
section, we should be able to search for the exact phrase 
"Romeo and Juliet": 



QS 


MG4J 


Lucene 


Zettair 


2.53 


15.12 


6.36 


1203.85 



Zettair performs particularly badly in this case. Our ability to 
address quickly any position of the index more than doubles 
the speed of our answer with respect to Lucene. This can be 
seen also from the timings for the conjunctive query contain- 
ing "Romeo", "and", and "Juliet": 



QS 


Kamikaze 


QS* 


MG4J 


Lucene 


Zettair 


0.51 


2.47 


0.92 


6.64 


3.41 


1244.03 



The number of results increases by w 15%. 

Proximity. As Table [3] shows, quick access to positions im- 
prove significantly another important aspect of a search en- 
gine: proximity queries. Here we show a roundup of the 
previous conjunctive queries resolved within a window of 16 
words: 





QS 


MG4J 


Lucene 


home page 


1625.30 


2134.45 


2079.52 


good home page 


754.25 


1498.17 


1203.64 


foo bar 


3.22 


12.84 


7.40 


fast slow 


23.33 


50.68 


39.11 


foo bar fast slow 


1.48 


9.15 


12.40 


romeo and juliet 


3.22 


16.20 


11.41 



These results show, in particular, that quick access to positions 
makes proximity computation always faster for more complex 
queries. 

C implementation. Finally, we show the timings for the 
same set of queries using C implementations of PForDelta and 
quasi-succinct indices: 





QS(C) 


PFD(C) 


QS*(C) 


PFD*(C) 


home page 


159.08 


134.14 


316.91 


162.40 


good home page 


63.06 


67.71 


121.36 


84.18 


foo bar 


0.73 


0.67 


1.01 


0.83 


fast slow 


6.34 


4.42 


8.36 


5.04 


foo bar fast slow 


0.74 


0.74 


0.82 


0.79 


romeo and juliet 


0.29 


0.90 


0.56 


1.00 



We already know from Table |4]that PForDelta optimised code 
is significantly faster at retrieving counts (see columns QS *(C) 
and PDF*(C)); the same comments apply. As expected, albeit 
in general slower our quasi-succinct C++ implementation is 
faster at solving queries with a mix of high-density and low- 
density terms ("good home page" and "romeo and juliet"). 

12 Conclusions 

We have presented a new inverted index based on the quasi- 
succinct encoding of monotone sequences introduced by Elias 
and on ranked characteristic functions. The new index pro- 
vides better compression than typical gap-encoded indices, 
with the exception of extremely compression-oriented tech- 
niques such as Golomb or interpolative coding. When com- 
pared with indices based on gap compression using variable- 
length byte encoding (Lucene) or -f/6 codes (MG4J), not only 
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we provide better compression, but significant speed improve- 
ment over conjunctive, phrasal and proximity queries. In gen- 
eral, any search engine accessing positional information for 
selecting or ranking documents out of a large collection would 
benefit from quasi-succinct indices (an example being tagged 
text stored in parallel indices). 

Our comparison with a C implementation of PForDelta 
compression for pointers and counts showed that PForDelta is 
slightly faster than quasi-succinct indices in computing con- 
junction, and significantly faster at retrieving counts, albeit 
in queries mixing terms with high and low frequency quasi- 
succinct indices can be extremely faster. Moreover, PForDelta 
(more precisely, the Kamikaze library) use 55% more space 
than a quasi-succinct index to compress pointers from the 
GOV2 collection. 

A drawback of quasi-succinct indices is that some ba- 
sic statistics (in particular, frequency, occurrency and the 
bound (0]i) must be known before the index is built. This im- 
plies that to create a quasi-succinct index from scratch it is 
necessary to temporary cache in turn each posting list (e.g., 
using a traditional gap-compressed format) and convert it to 
the actual encoding only when all postings have been gener- 
ated. While it is easy to do such a caching offline, it could 
slow down index construction. 

On the other hand, this is not a serious problem: in prac- 
tice, large indices are built by scanning incrementally (pos- 
sibly in parallel) a collection, and merges are performed pe- 
riodically over the resulting segments (also called barrels or 
batches). Since during the construction of a segment it is triv- 
ial to store the pieces of information that are needed to build 
a quasi-succinct index, there is no need for an actual two-pass 
construction: segments can be compressed using gap encod- 
ing, whereas large indices can be built by merging in a quasi- 
succinct format. 

Note that if computing the least significant bit, selection- 
in-a-word and sideways addition were available in hardware, 
the decoding speed of a quasi-succinct index would signifi- 
cantly increase, as about 30% decoding time is spent read- 
ing unary codes. It is difficult to predict the impact of 
such hardware instructions on skipping, but we would cer- 
tainly expect major speedups. In Java virtual machines, this 
would lead a to better intrinsification of methods such as 
Long.numberOfTrailingZerosO, whereas the gcc com- 
piler could provide faster versions of built-in functions such 
as builtin_ctzll () . 

An interesting area of future research would be extending 
the techniques described in this paper to impact-sorted in- 
dices, in which documents are sorted following a retrieval- 
based impact order |3), and only documents pointers with the 
same impact are monotonically increasing. A technique simi- 
lar to that used in this paper to store positions (i.e., a different 
encoding for the start of each block) might provide new inter- 
esting tradeoffs between compression and efficiency. 
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